[Bugfix][TPU][V1] Fix recompilation #15553

NickLucche · 2025-03-26T14:44:10Z

It appears recompilation was due to a misconfiguration when pre-compiling the num_reqs.
Added tests to CI to confirm this upstream.

github-actions · 2025-03-26T14:44:22Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

robertgshaw2-redhat · 2025-03-26T14:45:31Z

nice job nicolo!

mgoin · 2025-03-26T17:18:04Z

vllm/v1/worker/tpu_model_runner.py

MIN_NUM_SEQS is 8, should this be rounding up to the nearest number divisible by MIN_NUM_SEQS?

not necessarily see in capture_model, max_num_reqs still gets compiled anyway due to how padding with upper limit is implemented.

btw I'm ok with forcing everything to be nicely divisible by MIN_NUM_SEQS; just I remember max_num_seqs used to be padded and then it was changed for reasons

yarongmu-google · 2025-03-26T17:41:27Z

tests/tpu/test_compilation.py failed it seems w/ this error:

Processed prompts: 100%|████████████████████████████████████████████████████| 1/1 [00:45<00:00, 46.00s/it, est. speed input: 109.60 toks/s, output: 0.11 toks/s]

| WARNING 03-26 16:58:08 [parallel_state.py:1093] torch._C._host_emptyCache() only available in Pytorch >=2.5
| FAILEDWARNING 03-26 16:58:08 [parallel_state.py:1093] torch._C._host_emptyCache() only available in Pytorch >=2.5

Source: https://buildkite.com/vllm/fastcheck/builds/18374#0195d341-8f23-4eb8-bd34-998e6055b2ff

bvrockwell · 2025-03-26T19:43:13Z

tests/tpu/test_compilation.py failed it seems w/ this error:

Processed prompts: 100%|████████████████████████████████████████████████████| 1/1 [00:45<00:00, 46.00s/it, est. speed input: 109.60 toks/s, output: 0.11 toks/s]

| WARNING 03-26 16:58:08 [parallel_state.py:1093] torch._C._host_emptyCache() only available in Pytorch >=2.5 | FAILEDWARNING 03-26 16:58:08 [parallel_state.py:1093] torch._C._host_emptyCache() only available in Pytorch >=2.5

Source: https://buildkite.com/vllm/fastcheck/builds/18374#0195d341-8f23-4eb8-bd34-998e6055b2ff

@NickLucche would we be able to set --enforce-eager=False on all tests except test_compilation.py please to capture whether recompilation is resolved fully?

robertgshaw2-redhat · 2025-03-26T19:44:20Z

tests/tpu/test_compilation.py failed it seems w/ this error:

Processed prompts: 100%|████████████████████████████████████████████████████| 1/1 [00:45<00:00, 46.00s/it, est. speed input: 109.60 toks/s, output: 0.11 toks/s]

| WARNING 03-26 16:58:08 [parallel_state.py:1093] torch._C._host_emptyCache() only available in Pytorch >=2.5 | FAILEDWARNING 03-26 16:58:08 [parallel_state.py:1093] torch._C._host_emptyCache() only available in Pytorch >=2.5
Source: https://buildkite.com/vllm/fastcheck/builds/18374#0195d341-8f23-4eb8-bd34-998e6055b2ff

@NickLucche would we be able to set --enforce-eager=False on all tests except test_compilation.py please to capture whether recompilation is resolved fully?

It’s false by default.

bvrockwell · 2025-03-26T19:53:46Z

tests/tpu/test_compilation.py failed it seems w/ this error:

Processed prompts: 100%|████████████████████████████████████████████████████| 1/1 [00:45<00:00, 46.00s/it, est. speed input: 109.60 toks/s, output: 0.11 toks/s]

| WARNING 03-26 16:58:08 [parallel_state.py:1093] torch._C._host_emptyCache() only available in Pytorch >=2.5 | FAILEDWARNING 03-26 16:58:08 [parallel_state.py:1093] torch._C._host_emptyCache() only available in Pytorch >=2.5
Source: https://buildkite.com/vllm/fastcheck/builds/18374#0195d341-8f23-4eb8-bd34-998e6055b2ff

@NickLucche would we be able to set --enforce-eager=False on all tests except test_compilation.py please to capture whether recompilation is resolved fully?

It’s false by default.

https://github.com/NickLucche/vllm/blob/tpu-fix-recompilation/tests/v1/tpu/test_basic.py#L34

Am I reading this wrong, where should I be checking this otherwise?

lsy323 · 2025-03-26T22:04:44Z

vllm/v1/worker/tpu_model_runner.py

@NickLucche thank you for fixing this! qq: is this change actually fix the recompilation issue?

I also have such confusion. IMO, the code fix the recompilation issue when max_num_reqs is not power of 2. But in our tests, it's already power of 2.

bvrockwell · 2025-03-26T22:22:18Z

.buildkite/run-tpu-v1-test.sh

Does it make sense to also enable the following in test_basic.py? enforce_eager=False

I think so. We should use the default value of enforce_eager (which is False) in most cases.

indeed, it is set to True in test_basic.py

mergify · 2025-03-27T05:54:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

yaochengji

@NickLucche Thanks for fixing this. Left a few comments.

yaochengji · 2025-03-27T05:48:01Z

vllm/v1/worker/tpu_model_runner.py

Why don't we use torch.compile for this method?

Usually it's easier to know the boundary of TPU computation and avoid recompilation if we wrap it inside a torch.compile

once main is stable we'll turn that on. I'd like to do that in a separate PR. Last time around compilation got slower, just wanted to be cautious.

I did an experiment base on you branch, I cleaned the xla cache each time before the execution.
without torch.compile: sampler pre-compilation time is 35.79 [secs]
with torch.compile: sampler pre-compilation time is 35.51 [secs]

The compilation time difference is negligible. In the meantime, torch.compile can speed up the execution because the guard check of torch.compile is usually faster than torch/xla's graph trace.

happy to add it back then! Just would like to merge this PR first if you don't mind, it still fixes the pre-compiliation when max_num_reqs is not a power of 2

Could you add it here in this PR? Just one line code change.

yaochengji · 2025-03-27T05:52:09Z

.buildkite/run-tpu-v1-test.sh

I think so. We should use the default value of enforce_eager (which is False) in most cases.

yaochengji · 2025-03-27T05:55:50Z

vllm/v1/worker/tpu_model_runner.py

we'd better add xm.mark_step() before this line unless we use torch.compile.

isn't it redundant? A sync to CPU will still cause the graph to be flushed and executed

TL;DR, xm.mark_step computes all the pending tensor, but B.cpu() only computes the exact tensor B.

E.g. say we have the code

A = ... B = op(A)

The graph output generated by xm.mark_step() and B.cpu() are different.

For xm.mark_step(), we will get both A and B as outputs.
For B.cpu(), only B is the output.

Then if we have another xm.mark_step() later, as A's result is not returned in the previous computation, we have to compute A again.

Thanks for explaining this is really helpful.
I don't think this is the case as we only need the sampled output tokens from the sampler step, but I could also add it for completeness.

Although we don't intend to use other tensors, a later xm.mark_step will still try to get the results of them. Then we have redundant computation.

BTW, we have an implicit xm.mark_step when using torch.compile. What's one reason I recommend using torch.compile when possible.

And when we use torch.compile, we don't need so many xm.mark_step.

torch.compile is much more easier for pytorch developer to understand.

NickLucche · 2025-03-27T08:59:44Z

@NickLucche thank you for fixing this! qq: is this change actually fix the recompilation issue?

So I've added the test you mentioned in the previous PR, test_sampler.py with enforce_eager=False and you can run it with VLLM_XLA_CHECK_RECOMPILATION=1 .
If you did the same thing prior to this PR it would show a graph compilation was detected at runtime.

Mentioned it in this thread too #15309.

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche · 2025-03-27T13:06:18Z

Also another test worth running that I had mentioned on slack but not here is a benchmark with the check recompilation on.

VLLM_XLA_CHECK_RECOMPILATION=1 VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
 --disable-log-requests \
 --port 8004 \
 --gpu-memory-utilization 0.95 \
 --max-num-seqs 512 \
 --max-num-batched-tokens 512 \
 --tensor-parallel-size 1 \
 --max-model-len 2048 > "$VLLM_LOG" 2>&1 &

Feel free to ping me if you spot any case where the check fails though!

DarkLight1337 · 2025-03-27T14:31:19Z

Can you push a dummy commit to retry the doc build?

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche · 2025-03-27T16:14:45Z

thanks for the ping @DarkLight1337

bvrockwell

Thanks for looking into this @NickLucche ! @yaochengji recommended this yesterday and it's a good idea: could we please set enforce-eager for all tests (like test_basic.py, where this is still set to True currently) except for test_compilation.py. This ensures all tests are checking for recompilation and so nothing slips inadvertently in the future.

NickLucche · 2025-03-27T18:22:26Z

Sure, it's a bit out of scope for this PR I'd suggest we open another one.
Here I only meant to fix the issues related to max_num_reqs when compiling on main + adding a test (test 7) for it.

yaochengji

Thanks for your fix again. Could you address my comments in this PR?

yaochengji · 2025-03-27T18:20:16Z

vllm/v1/worker/tpu_model_runner.py

Although we don't intend to use other tensors, a later xm.mark_step will still try to get the results of them. Then we have redundant computation.

BTW, we have an implicit xm.mark_step when using torch.compile. What's one reason I recommend using torch.compile when possible.

yaochengji · 2025-03-27T18:23:05Z

vllm/v1/worker/tpu_model_runner.py

And when we use torch.compile, we don't need so many xm.mark_step.

torch.compile is much more easier for pytorch developer to understand.

yaochengji · 2025-03-27T18:23:26Z

vllm/v1/worker/tpu_model_runner.py

Could you add it here in this PR? Just one line code change.

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

Signed-off-by: NickLucche <nlucches@redhat.com>

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

NickLucche marked this pull request as ready for review March 26, 2025 14:44

NickLucche requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners March 26, 2025 14:44

mergify bot added ci/build v1 labels Mar 26, 2025

mgoin reviewed Mar 26, 2025

View reviewed changes

lsy323 reviewed Mar 26, 2025

View reviewed changes

bvrockwell reviewed Mar 26, 2025

View reviewed changes

mergify bot added the tpu Related to Google TPUs label Mar 27, 2025

robertgshaw2-redhat assigned yaochengji Mar 27, 2025

mergify bot added the needs-rebase label Mar 27, 2025

yaochengji reviewed Mar 27, 2025

View reviewed changes

NickLucche mentioned this pull request Mar 27, 2025

[TPU] [V1] fix cases when max_num_reqs is set smaller than MIN_NUM_SEQS #15583

Merged

NickLucche added 5 commits March 27, 2025 12:54

fix seq_len pre-compilation issue in sampling

ef3522a

Signed-off-by: NickLucche <nlucches@redhat.com>

enable sampling tests

fab4026

Signed-off-by: NickLucche <nlucches@redhat.com>

I learned to count

29246d9

Signed-off-by: NickLucche <nlucches@redhat.com>

typo

d524324

Signed-off-by: NickLucche <nlucches@redhat.com>

typo

2e61e31

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche force-pushed the tpu-fix-recompilation branch from 957f043 to 2e61e31 Compare March 27, 2025 13:03

mergify bot removed the needs-rebase label Mar 27, 2025

mgoin approved these changes Mar 27, 2025

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 27, 2025

mgoin enabled auto-merge (squash) March 27, 2025 14:11

empty

de603aa

Signed-off-by: NickLucche <nlucches@redhat.com>

bvrockwell suggested changes Mar 27, 2025

View reviewed changes

yaochengji requested changes Mar 27, 2025

View reviewed changes

mgoin merged commit 4098b72 into vllm-project:main Mar 27, 2025
32 checks passed

Alex4210987 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Apr 5, 2025

[Bugfix][TPU][V1] Fix recompilation (vllm-project#15553)

2d8ce4c

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[Bugfix][TPU][V1] Fix recompilation (vllm-project#15553)

873f81c

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[Bugfix][TPU][V1] Fix recompilation (vllm-project#15553)

b2a63f4

Signed-off-by: NickLucche <nlucches@redhat.com>

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Bugfix][TPU][V1] Fix recompilation (vllm-project#15553)

e1c6a6e

Signed-off-by: NickLucche <nlucches@redhat.com>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Bugfix][TPU][V1] Fix recompilation (vllm-project#15553)

9ca68b8

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

Uh oh!

[Bugfix][TPU][V1] Fix recompilation #15553

[Bugfix][TPU][V1] Fix recompilation #15553

Uh oh!

Conversation

NickLucche commented Mar 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 26, 2025

Uh oh!

robertgshaw2-redhat commented Mar 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yarongmu-google commented Mar 26, 2025

Processed prompts: 100%|████████████████████████████████████████████████████| 1/1 [00:45<00:00, 46.00s/it, est. speed input: 109.60 toks/s, output: 0.11 toks/s]

Uh oh!

bvrockwell commented Mar 26, 2025

Processed prompts: 100%|████████████████████████████████████████████████████| 1/1 [00:45<00:00, 46.00s/it, est. speed input: 109.60 toks/s, output: 0.11 toks/s]

Uh oh!

robertgshaw2-redhat commented Mar 26, 2025

Processed prompts: 100%|████████████████████████████████████████████████████| 1/1 [00:45<00:00, 46.00s/it, est. speed input: 109.60 toks/s, output: 0.11 toks/s]

Uh oh!

bvrockwell commented Mar 26, 2025

Processed prompts: 100%|████████████████████████████████████████████████████| 1/1 [00:45<00:00, 46.00s/it, est. speed input: 109.60 toks/s, output: 0.11 toks/s]

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 27, 2025

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche commented Mar 27, 2025

Uh oh!

NickLucche commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Mar 27, 2025

Uh oh!

NickLucche commented Mar 27, 2025

Uh oh!

bvrockwell left a comment

Choose a reason for hiding this comment

NickLucche commented Mar 26, 2025 •

edited by github-actions bot

Loading

NickLucche commented Mar 27, 2025 •

edited

Loading